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Abstract A fundamental aspect of biological information processing is the ubiquity of sequence- 
function relationships - functions that map the sequence of DNA, RNA, or protein to a biochemically 
relevant activity. Most sequence-function relationships in biology are quantitative, but only recently 
have experimental techniques for effectively measuring these relationships been developed. The advent 
of such “massively parallel” experiments presents an exciting opportunity for the concepts and methods 
of statistical physics to inform the study of biological systems. After reviewing these recent experimental 
advances, we focus on the problem of how to infer parametric models of sequence-function relationships 
from the data produced by these experiments. Specifically, we retrace and extend recent theoretical 
work showing that inference based on mutual information, not the standard likelihood-based approach, 
is often necessary for accurately learning the parameters of these models. Closely connected with this 
result is the emergence of “diffeomorphic modes” - directions in parameter space that are far less 
constrained by data than likelihood-based inference would suggest. Analogous to Goldstone modes 
in physics, diffeomorphic modes arise from an arbitrarily broken symmetry of the inference problem. 
An analytically tractable model of a massively parallel experiment is then described, providing an 
explicit demonstration of these fundamental aspects of statistical inference. This paper concludes with 
an outlook on the theoretical and computational challenges currently facing studies of quantitative 
sequence-function relationships. 


1 Introduction 

A major long-term goal in biology is to understand how biological function is encoded within the 
sequences of DNA, RNA, and protein. The canonical success story in this effort is the genetic code: 
given an arbitrary sequence of messenger RNA, the genetic code allows us to predict with near certainty 
what peptide sequence will result. There are many other biological codes we would like to learn as well. 
How does the DNA sequence of a promoter or enhancer encode transcriptional regulatory programs? 
How does the sequence of pre-mRNA govern which exons are kept and which are removed from the 
final spliced mRNA? How does the peptide sequence of an antibody govern how strongly it binds to 
target antigens? 

A major difference between the genetic code and these other codes is that while the former is 
qualitative in nature, the latter are governed by sequence-function relationships that are inherently 
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Fig. 1: Sequence-function relationships in biology, (a) A sequence-function relationship maps a bio¬ 
logical sequence (blue bar) to a biologically relevant activity (yellow star), (b) One of the simplest 
sequence-function relationships is how the affinity (star) of a transcription factor protein (magenta) 
for its DNA binding site depends on the sequence of that site (blue), (c) A more complicated sequence- 
function relationship describes how the rate of mRNA transcription depends on the DNA sequence 
of a gene’s promoter region. At the lac promoter of E. coli (illustrated), this transcription rate (star) 
depends on how strongly both the transcription factor CRP (purple) and the RNA polymerase holoen- 
zyme (RNAP; orange) bind their respective sites within the promoter region (blue). 


quantitative. Quantitative sequence-function relationshipsQ describe any function that maps the se¬ 
quence of a biological heteropolymer to a biologically relevant activity (Fig. la). Perhaps the simplest 
example of such a relationship is how the affinity of a transcription factor protein for its DNA binding 
site depends on the DNA sequence of that site (Fig. lb). Such relationships are a key component of 
the more complicated relationship between the DNA sequence of a promoter or enhancer (which typi¬ 
cally binds multiple proteins) and the resulting rate of mRNA transcription (Fig. Ic). In both of these 
cases, the activities of interest (affinity or transcription rate) can vary over orders of magnitude and 
yet still be finely tuned by adjusting the corresponding sequence (binding site or promoter/enhancer). 
Similarly other sequence-function relationships, like the inclusion of exons during mRNA splicing or 
the affinity of a protein for its ligand, are fundamentally quantitative. 

The study of quantitative sequence-function relationships presents an exciting opportunity for the 
concepts and methods of statistical physics to shed light on biological systems. There is a natural 
analogy between biological sequences and the microstates of physical systems, as well as between 
biological activities and physical Hamiltonians. Yet we currently lack answers to basic questions a 
statistical physicist might ask, such as “what is the density of states?” or “is a relationship convex 
or glassy?” The answers to such questions may well have important consequences for diverse fields 
including biochemistry, systems biology, immunology, and evolution. 

Experimental methods for measuring sequence-function relationships have improved dramatically in 
recent years. In the mid 2000s, multiple “high-throughput” methods for measuring the DNA sequence 
specificity of transcription factors were developed; these methods include protein binding microarrays 
(PBMs) |UI2i A. coli one-hybrid technology (EIH) [1], and microfluidic platforms [S]. The subsequent 
development and dissemination of ultra-high-throughput DNA sequencing technologies then led, start¬ 
ing in 2009, to the creation of a number of “massively parallel” experimental techniques for probing 
a wide range of sequence-function relationships (Table I). These massively parallel assays can readily 
measure the functional activity of 10^ to 10® sequences in a single experiment by coupling standard 
bench-top techniques to ultra-high-throughput DNA sequencing. 

Massively parallel experiments are very unlike conventional experiments in physics: they are typi¬ 
cally very noisy and rarely provide direct readouts of the quantities that one cares about. Moreover, 
the noise characteristics of these measurements are difficult to accurately model. Indeed, such noise 


^ These have also called quantitative sequence-activity maps, or QSAMs [T]. 
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sequence 

activity 

system 

name 

publication 

DNA 

binding 

sites 

protein-DNA 

binding 

affinity 

purified 

protein 

Bind-n-Seq 

Zykovich et ah, 2009 (5) 

HT-SELEX 

Zhao et ah, 2009 0 


Jolma et ah, 2010 [5] 

EMSA-Seq 

Wong et ah, 2011 [9] 

SELEX-Seq 

Slattery et ah, 2011 |10| 

promoter/ 

enhancer 

DNA 

transcription 

rate 

purified protein 


Patwardhan et ah, 2009 [llj 

bacteria 

Sort-Seq 

Kinney et ah, 2010 [12] 

cell culture 

MPRA 

Melnikov et ah, 2012 [I] 

mouse liver 


Patwardhan et ah, 2012 |13| 

yeast 


Sharon et ah, 2012 |14| 

mouse retina 

CRE-Seq 

Kwasniesk et ah, 2012 |15| 

protein 

ligand binding 

phage display 

DMS 

Fowler et ah, 2010 [16] 

cellular growth rate 

yeast 

EMPIRIC 

Hietpas et ah, 2011 [Tfj 

toxin activity 

bacteria 


Adkar et ah, 2012 [IB] 

HlNl binding 

yeast display 


Whitehead et ah, 2012 |Tg] 

GPCR expression 

bacteria 


Schlinkmann et ah, 2012 [2D] 

RNA 

mRNA translation 

bacteria 


Holmqvist et ah, 2013 [21] 

sRNA targeting 

bacteria 

qSortSeq 

Peterman et ah, 2014 [22] 

mRNA translation 

cell culture 


Oikonomou et ah, 2014 [23] 

mRNA translation 

cell culture 

FACS-Seq 

Noderer et ah, 2014 [24] 

replication origins 

DNA replication 

yeast 

ARS-Seq 

Liachko et ah, 2013 [^ 

endonuclease sites 

DNA cutting 

purified protein 


Thyme et ah, 2014 |26] 


Table 1: Massively parallel experiments used for studying various sequence-function relationships. 
Columns show the type of sequences interrogated, the sequence activity assayed, the biological system 
on which the experiments were performed, the name (if any) of the experimental technique, and the 
publication first describing the method. This table is not exhaustive; it only describes some of the 
earliest experiments in each type of system. 


generally exhibits substantial day-to-day variability. Although standard inference methods require an 
explicit model of experimental noise, it is still possible to precisely learn quantitative sequence-function 
relationships from massively parallel data even when noise characteristics are unknown EZlllH]. 

The ability to fit parametric models to these data reflects subtle but important distinctions between 
two objective functions used for statistical inference: (i) likelihood, which requires a priori knowledge 
of the experimental noise function and (ii) mutual information |29j . a quantity based on the concept 
of entropy, which does not require a noise function. In contrast to the conventional wisdom that more 
experimental measurements will improve the model inference task, the standard maximum likelihood 
approach will typically never learn the right model, even in the infinite data limit, if one uses an 
imperfect model of experimental noise. Model inference based on mutual information does not suffer 
from this ailment. 

Mutual-information-based inference is unable to pin down the values of model parameters along 
certain directions in parameter space known as “diffeomorphic modes” |28j . This inability is not a 
shortcoming of mutual information, but rather reflects a fundamental distinction between how dif¬ 
feomorphic and nondiffeomorphic directions in parameter space are constrained by data. Analogous 
to the emergence of Goldstone modes in particle physics due to a specific yet arbitrary choice of 
phase, diffeomorphic modes arise from a somewhat arbitrary choice one must make when defining the 
sequence-dependent activity that one wishes to model. Likelihood, in contrast to mutual information, 
is oblivious to the distinction between diffeomorphic and nondiffeomorphic modes. 

We begin this paper by briefly reviewing a variety of massively parallel assays for probing quanti¬ 
tative sequence-function relationships. We then turn to the problem of learning parametric models of 
these relationships from the data that these experiments generate. After reviewing recent work on this 
problem [23], we extend this work in three ways. First, we show that “diffeomorphic modes” of the 
parametric activity model that one wishes to learn are “dual” to certain transformations of the corre- 
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spending model of experimental noise (the “noise function”). This duality reveals a symmetry of the 
inference problem, thereby establishing a close analogy with Goldstone modes. Next we compute and 
compare the Hessians of likelihood and mutual information. This comparisons suggests an additional 
analogy between this inference problem and concepts in fluid mechanics. Finally, we work through an 
analytically tractable model of a massively parallel experiment of protein-DNA binding. This example 
explicitly illustrates the differences between likelihood- and mutual-information-based approaches to 
inference, as well as the emergence of diffeomorphic modes. 

It should be noted that the inference of receptive fields in sensory neuroscience is another area 
of biology in which mutual information has proved useful as an objective function, and that work in 
this area has also provided important insights into basic aspects of machine learning [5Dl[5n[5^[551 
134] . Indeed, the problem of learning quantitative sequence-function relationships in molecular biology 
is very similar to the problem of learning receptive fields in neuroscience |28| . The discussion of this 
problem in the neuroscience context, however, has largely avoided in-depth analyses of how mutual 
information relates to likelihood, as well as of how diffeomorphic modes emerge. 


2 Massively parallel experiments probing sequence-function relationships 

All of the massively parallel experiments in Table 1 share a common structure (Fig. 2a). The first step 
in each experiment is to generate a large set of (roughly 10^ to 10®) different sequences to measure. 
This set of sequences is called the “library.” Multiple different types of libraries can be used depending 
on the application. One then performs an experiment that takes this library as input, and as output 
provides a set of one or more “bins” of sequences. Each output bin contains sequences selected from 
the library with a weight that depends on the measured activity of that sequence. Finally, a sample 
of sequences from each of the output bins, as well as from the input library, are determined using 
ultra-high-throughput DNA sequencing. The resulting data thus consists of a long list of (typically 
non-unique) DNA sequences, each assigned to a corresponding bin (Fig. 2b). It is from these data that 
we wish to learn quantitative models of sequence-function relationships. 

Some of the earliest massively parallel experiments were designed to measure the specificity of 
purified transcription factors for their DNA binding sites [SlFfllSlI^lfTU] (Fig. 2c). The library used in 
such studies consists of a fixed-length region of random DNA flanked by constant sequences used for 
PGR amplification. This library is mixed with the transcription factor of interest, after which protein- 
bound DNA is separated from unbound DNA, e.g., by running the protein-DNA mixture on a gel. 
Protein-bound DNA is then sequenced, along with the input library. 

Using a library of random DNA to assay protein-DNA binding has the advantage that the same 
library can be used to study each protein. This is particularly useful when performing assays on 
many different proteins at once (e.g., [8ll35jl. On the other hand, only a very small fraction of library 
sequences will be specifically bound by the protein of interest. Moreover, because proteins typically 
bind DNA in a non-specific manner, such experiments are often performed serially in order to achieve 
substantial enrichment 0 

The first massively parallel experiment to probe how multi-protein-DNA complexes regulate tran¬ 
scription in living cells was Sort-Seq |12| (Fig. 2d). The sequence library used in this experiment was 
generated by introducing randomly scattered mutations into a “wild-type” sequence of interest, specif¬ 
ically, the 75 bp region of the promoter of the lac gene in E. coli depicted in Fig. 3a. A few million 
of these mutant promoters were cloned upstream of the green fluorescent protein (GFP) gene. Gells 
carrying these expression constructs were grown under conditions favorable to promoter activity and 
were then sorted into a small number of bins according to each cell’s measured fluorescence. This 
partitioning of cells was accomplished using fluorescence-activated cell sorting (FAGS) [H], a method 
that can readily sort ~ 10^ cells per second. The mutant promoters within each sorted bin as well as 
within the input library were then sequenced, yielding measurements for ^ 2 x 10® variant promoter 
sequences. We note that advances in DNA sequencing have since made it possible to accumulate much 
more data, and it is no long difficult to measure the activities of ~ 10^ different sequences in this 
manner. 

^ This serial e nrichment appro ach is known as SELEX and is much older than ultra-high-throughput DNA 
sequencing; see [H51l37ll38ll39U40| . 
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Fig. 2: Overview of massively parallel experiments for studying quantitative sequence-function relation¬ 
ships. (a) The input to each experiment is a library of different sequences that one wishes to test. The 
output is one or more bins of sequences; each sequence in each bin is randomly selected from the library 
with a weight that depends on a measurement of that sequence’s activity (star), (b) The resulting data 
set consists of a list of (non-unique) sequences, each sequence assigned to either the input library or 
one of the output bins, (c) Illustration of experimental methods for measuring the sequence-dependent 
binding energy of purified transcription factor proteins. The input library typically consists of random 
DNA flanked by constant sequence. This library DNA is mixed with the protein of interest and binding 
is allowed to come to equilibrium. DNA bound by protein is then separated from unbound DNA, e.g. 
by running complexes on a gel (shown), then sequenced along with a sample from the input library, (d) 
Sort-Seq [12] is a massively parallel experiment that uses a library of partially mutagenized sequences 
to probe the mechanisms of transcriptional regulation employed by a specihc wild type promoter of 
interest. Mutant promoters are cloned upstream of the GFP gene, and E. coli cells harboring these 
expression constructs are sorted into bins using FACS. The mutant promoters in each bin, as well as 
promoters from the input library, are then sequenced. 
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Fig. 3: The lac promoter region studied in [12]. (a) Sort-Seq was used to dissect a 75 bp region of 
the E. coli lac promoter using a library consisting of wild type sequences mutagenized at 12% per 
nucleotide, i.e., each library sequence had 9 mutations on average, (b) The resulting data were used 
to learn a quantitative sequence-function relationship, the mathematical form of which reflected an 
explicit biophysical model of transcriptional regulation. This model included two “energy matrices” 
describing the sequence-dependent binding energy of CRP (Q) and RNAP (P) to their respective sites. 
It also included a value for the interaction energy 7 between these two proteins. 


Massively parallel experiments using partially mutagenized sequences provide data about sequence- 
function relationships within a localized region of sequence space centered on the wild type sequence of 
interest. Measuring these local relationships can provide a wealth of information about the functional 
mechanisms of the wild type sequence. For instance, the Sort-Seq data of [T2| allowed the inference of 
an explicit biophysical model for how CRP and RNAP work together to regulate transcription at the 
lac promoter (Fig. 3b). In particular, the authors used their data to learn quantitative models for the 
in vivo sequence specificity of both CRP and RNAP. Model fitting also enabled measurement of the 
protein-protein interaction by which CRP is able to recruit RNAP and up-regulate transcription. 

Partially mutagenized sequences have also been used extensively for “deep mutational scanning” 
experiments on proteins, starting with nanzj. In this context, selection experiments on partially 
mutagenized proteins allow one to identify protein domains critical for folding and function. A variety 
of deep mutational scanning experiments are described in [42j . 


3 Inference using likelihood 

The inference of quantitative sequence-function relationships from massively parallel experiments can 
be phrased as follows. Data consists of a large number of sequences each sequence S having 

a corresponding measurement M. Due to experimental noise, repeated measurements of the same 
sequence S can yield different values for M. Our experiment therefore has the following probabilistic 
form form: 

experiment 

s Fm M , (1) 

sequence measurement 
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(a) joint likelihood 



(b) likelihood with fixed 
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Fig. 4: Schematic illustration of how likelihood L{6, tt) depends on the model 9 and the noise function 
TT in the TV —>■ oo limit. (a,b) L will typically have a correlated dependence on 9 and tt. If tt is set equal 
to the correct noise function tt*, then L will be maximized by the correct model 9*. However, if tt is 
set to an incorrect noise function tt', L will typically attain a maximum at an incorrect 9'. 


If we assume that the measurements for each sequence are independent, and if we have an explicit 
parametric form for p{M\S)^ then we can learn the values of the parameters by maximizing the per- 
datum log likelihood, 

1 ^ 

L=^Y.^ogp{M^\Sr.). ( 2 ) 

n—1 


In what follows we will refer to the quantity L simply as the “likelihood.” 

In regression problems such as this, one introduces an additional layer of structure. Specifically, 
we assume the measurement M of each sequence S' is a noisy readout of some underlying activity R 
that is a deterministic function of that sequence. We call the function relating R to S the “activity 
model” and denote it using 9{S). This activity model is ultimately what we want to understand. The 
specific way the activity R is read out by measurements M is then specified by a conditional probability 
distribution, 7r(M|i?), which we call the “noise function.”]^ Our experiment is thus represented by the 
Markov chain 


model 

s M. 

sequence 


R 

activity 


noise function 

7r(M|i?) 


M 


measurement 


( 3 ) 


The corresponding likelihood is 

1 ^ 

= j^Y.^og7r{M„\9iSr.)). (4) 

n—1 

The model we adopt for our experiment therefore has two components: 9, which describes the sequence- 
function relationship of interest, and tt, which we do not really care about. 

Standard statistical regression requires that the noise function tt be specified up-front, tt can be 
learned either by performing separate calibration experiments, or by assuming a functional form based 
on an educated guess. This can be problematic, however. Consider inference in the large data limit, 
iV —oo, which is illustrated in Fig. Likelihood is determined by both the model 9 and the noise 
function tt (Fig. &)■ If we know the correct noise function n* exactly, then maximizing L(0,7r*) over 
9 is guaranteed to recover the correct model 9*. However, if we assume an incorrect noise function tt', 
maximizing likelihood will typically recover an incorrect model 9' (Fig. i’)- 

^ We use the term “noise function” in order to be consistent with the terminology of [28] and to avoid 
deviating too much from the more standard terms “noise model” and “error model” used in the statistics and 
machine learning literature. We emphasize, however, that tt defines much more than just the characteristics of 
experimental noise; tt entirely specifies the relationship between measurements M and the underlying activity 
R. Were it not for prior terminology, the term “measurement function” might be preferable to “noise function.” 


















4 Inference using mutual information 


Information theory provides an alternative inference approach. Suppose we hypothesize a specific model 
0, which gives predictions R. Denote the true model 9* and the corresponding true activity R*. The 
dependence between S, M, R*, and R will then form a Markov chain, 

R - - S -- R* -- M. (5) 

From the simple fact that M depends on R only through the value of R*, any dependence measure T) 
that satisfies the Data Processing Inequality (DPI) must satisfy 

V[R]M\<V[R*]M\. (6) 

Therefore, in the set of possible models 9, the true model is guaranteed to globally maximize the 
objective function 'D{9) = M], 

One particularly relevant dependence measure that satisfies DPI is mutual information, a quantity 
that plays a fundamental role in information theory [29] For the massively parallel experiments such 
as those in Fig. 2, R is continuous and M is discrete. In these cases, mutual information is given by 

m = /|R; "I = E / ") >»g f(R)f'MY 

where p{M, R) is the joint distribution of activity predictions and measurements resulting from the 
model 0. If one is able to estimate p{M, R) from a finite sample of data, mutual information can be 
used as an objective function for determining 9 without assuming any noise function tt. 

It should be noted that there are multiple dependence measures T) that satisfy DPI. One might 
wonder whether maximizing multiple different dependence measures would improve on the optimization 
of mutual information alone. The answer is not so simple. In [28j it was shown that if the correct model 
9* is within the space of models under consideration, then, in the large data limit, maximizing mutual 
information is equivalent to simultaneously maximizing every dependence measure that satisfies DPI. 
On the other hand, one rarely has any assurance that the correct model 9* is within the space of 
parameterized models one is considering. In this case, considering different DPI-satisfying measures 
might provide a test for whether 9* is noticeably outside the space of parameterized models. To our 
knowledge, this potential approach to the model selection problem has yet to be demonstrated. 


5 Relationship between likelihood and mutual information 

A third inference approach is to admit that we do not know the noise function tt a priori, and to fit 
both 9 and tt simultaneously by maximizing L{9,tt) over this pair. It is easy to see why this makes 
sense: the division of the inference problem into first measuring tt, then learning 9 using that inferred 
TT, is somewhat artificial. The process that maps S' to M is determined by both 9 and tt and thus, 
from a probabilistic point of view, it makes sense to maximize likelihood over both of these quantities 
simultaneously. 

We now show that, in the large N limit, maximizing likelihood over both 9 and tt is equivalent to 
maximizing the mutual information between model predictions and measurements. Here we follow the 
argument given in [2H]- In the large N limit, likelihood can be written 

L(0,7r)=^^ j dRp{R, M) log tt{M\R) (8) 

M •' 

= I{9)-D{9,n)-H[M], (9) 

where 

D(.S, ^) = Y.J log (10) 


^ See [13] for an extended discussion of mutual information as a measure of statistical association. 
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is the Kullback-Leibler divergence between the assumed noise function tt and the observed noise func¬ 
tion p{M\R), and H[M] = — logp(M) is the entropy of the measurements, which does not 

depend on 9. To maximize L{9,Tr) it therefore suffices to maximize I(9) over 9 alone, then to set the 
noise function Tr{M\R) equal to the empirical noise function p{M\R), which causes D{9,tt) to vanish. 

Thus, when we are uncertain about the noise function tt, we need not despair. We can, if we like, 
simply learn tt at the same time that we learn 9. We need not explicitly model tt in order to do this; 
it suffices instead to maximize the mutual information I(9) over 9 alone. 

The connection between mutual information and likelihood can further be seen in a quantity called 
the “noise-averaged” likelihood. This quantity was first described for the analysis of microarray data 
[22!; see also [28]. The central idea is to put an explicit prior on the space of possible noise functions, 
then compute likelihood after marginalizing over these noise functions. Explicitly, the per-datum log 
noise-averaged likelihood Lna{9) is related to L{9,Tr) via 

We will refer to Lna simply as “noise-averaged likelihood” in what follows. 

Under fairly general conditions, one finds that noise-averaged likelihood is related to mutual infor¬ 
mation via 


Laa{9) = I{9) - A{9) - H[M]. (12) 

Here, the effect of the noise function prior p(7r) is absorbed entirely by the term A{9). Under very weak 
assumptions, A{9) vanishes in the fV —)■ oo limit and thus pijr) becomes irrelevant for the inference 
problem [271128] . 


6 Diffeomorphic modes 

Mutual information has a mathematical property that is important to account for when using it as an 
objective function: the mutual information between any two variables is unchanged by an invertible 
transformation of either variable. So if a change in model parameters, 0—^0', results in changes in 
model predictions R ^ R' that preserves the rank order of these predictions, then 

I{9) = I[M- R] = J[M; R'] = I{9'), (13) 

and 9 and 9' are judged to be equally valid. 

By using mutual information as an objective function, we are therefore unable to constrain any 
parameters of 9 that, if changed, produce invertible transformations of model predictions. Such param¬ 
eters are called “diffeomorphic parameters” or “diffeomorphic modes” [55]. The distinction between 
diffeomorphic modes and nondiffeomorphic modes is illustrated in Fig. [^ 


6.1 Criterion for diffeomorphic modes 

Following J28j , we now derive a criterion that can be used to identify all of the diffeomorphic modes of 
a model 0|d Consider an infinitesimal change in model parameters 9 ^ 9 d9, where the components 

of d9 are specified by 


d9i = evi (14) 

for small epsilon and for some vector Vi in 0-space. This change in 0 will produce a corresponding 
change in model predictions R ^ R + dR, where 

f)R 

i 

® Here, as throughout this paper, we restrict our atte ntion to situations in which i? is a scalar. The case of 
vector-valued model predictions R is worked out in | 28| . 
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Fig. 5: Illustration of diffeomorphic and nondiffeomorphic modes, (a) A diffeomorphic mode at a 
point 9 in parameter space is a vector that will (regardless of the underlying data) be tangent to a level 
curve of I{9). All other vectors (e.g., correspond to nondiffeomorphic modes, (b) Moving 9 along 
a nondiffeomorphic mode results in a sort of “diffusion” in which the R values assigned to different 
sequences change rank order. Here, the probability distribution p{R\M) is illustrated (for fixed M) in 
gray. The motion of individual R values upon such a change in 9 are indicated by arrows, (c) Changing 
9 along a diffeomorphic mode, however, results in a “flow” of R values that maintains their rank order. 


In general, the derivative dRjd9i can have arbitrary dependence on the underlying sequence S. This 
transformation will preserve the rank order of i?-values only if dR is the same for all sequences having 
the same value of R. The change dR must therefore be a function of R and have no other dependence 
on S. A diffeomorphic mode is a vector field that has this property at all points in parameter 

space. Specifically, a vector field z;'^‘^(6i) is a diffeomorphic mode if and only if there is a function h{R, 9) 
such that 

f) R 

(16) 


6.2 Diffeomorphic modes of linear models 

As a simple example, consider a situation in which each sequence S' is a D-dimensional vector and R 
is an affine function of S, i.e. 


D 

R = 9o + ^ ^ 9iSi, 

i=l 


for model parameters 9 = {9o,9i 


.. ,9z)}. The criterion in Eq. (161 then gives 


D 

vf{9) + Y,vf{9)S, = h{R,9). 

2 = 1 


(17) 


(18) 


Because the left hand side is linear in S and R is linear in S, the function h{R, 9) must be linear in R. 
Thus, h must have the form 


h{R,9) = a{9) + b{9)R 

for some functions a{9) and h{9). The corresponding diffeomorphic mode is 


„dif 


{9) = 


(19) 


a{9) 1 = 0 

b{9)9, f = l,2,...,D’ 


( 20 ) 
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which has two degrees of freedom. Specifically, the a component of corresponds to adding a constant 
to R while the b component corresponds to multiplying i? by a constant. 

Note that if we had instead chosen R = he. left out the constant component 6 *o, then 

there would be only one diffeomorphic mode, corresponding to multiplication of i? by a constant. This 
fact will be used when we analyze the Gaussian selection model in Section 8 . 


6.3 Diffeomorphic modes of a biophysical model of transcriptional regulation 


Diffeomorphic modes can become less trivial in more complicated situations. Consider the biophysical 
model of transcriptional regulation by the E. coli lac promoter (Fig. [^. This model was fit to Sort-Seq 
data in [H]. The form of this model is as follows. Let S denote a 4 x D matrix representing a DNA 
sequence of length D and having elements 

j 1 if base 6 occurs at position Z 
'’'”10 otherwise 


where b G {A, C,G,T} and I = 1,2,... D. The binding energy Q of CRP to DNA was modeled in [T^ 
as an “energy matrix”: each position in the DNA sequence was assumed to contribute additively to 
the overall energy. Specifically, 

Q = Y,&QSbi + 0°Q, ( 22 ) 

b,l 

where 6q = are the parameters of this energy matrix. Similarly, the binding energy P of 

RNAP to DNA was modeled as 

P = J2SpSbi + 0°p. (23) 

b,l 


Both energies were taken to be in thermal units {ksT). The rate of transcription R resulting from 
these binding energies was assumed to be proportional to the occupancy of RNAP at its binding site. 
This is given by 


a-P 


R = R„ 


o-P-Q-1 


1 + e ^ + e ^ + e ^ ^ 


(24) 


where 7 is the interaction energy between CRP and RNAP (again in units of ksT). 

Because the binding sites for CRP and RNAP do not overlap, one can learn the parameters Oq 
and Op from data separately by independently maximizing I[Q; M] and I[P', M]. Doing this, however, 
leaves undetermined the overall scale of each energy matrix as well as the chemical potentials Op and 
Oq. The reason is that the energy scale and chemical potential are diffeomorphic modes of energy 
matrix models and therefore cannot be inferred by maximizing mutual information. 

However, if Q and P are inferred together by maximizing I[R', M] instead, one is now able to learn 
both energy matrices with a physically meaningful energy scale. The chemical potential of CRP, Oq, is 
also determined. The only parameters left unspecified are the chemical potential of RNA polymerase, 
^p, and the maximal transcription rate i?max- The reason for this is that in the formula for R in Eq. 
( |24[ ) the energies P and Q combine in a nonlinear way. This nonlinearity eliminates three of the four 
diffeomorphic modes of P and Q[^See [28] for the derivation of this result. 


6.4 Dual modes of the noise function 

Diffeomorphic transformations of model parameters can be thought of as being equivalent to certain 
transformations of the noise function tt. Consider the transformation of model parameters 

Oi ^ 0^ = Oi + cvi, (25) 


The one additional diffeomorphic mode is created by the introduction of the parameter Rmax- 
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diffeomorphic modes of 9 
& dual modes of vr 


Fig. 6: Venn diagram illustrating the degrees of freedom of the likelihood L{6, tt) considered over all 
possible data sets M„}. Altering the model parameters 6 will typically change L{9, tt) in a way that 
cannot be recapitulated by changes in the noise function tt. Similarly, changes in tt cannot typically be 
imitated by changes in 9. However, diffeomorphic transformations of 9 will affect A(0, tt) in the exact 
same way that dual transformation of tt will. The diffeomorphic modes of 9 and the dual modes of tt 
can therefore be thought of as lying within the intersection of 9 and tt. 


where e is an infinitesimal number and Vi is a vector in 0-spacej^For any sequence S', this transformation 
induces a transformation of the model prediction 

f)R 

= + —. (26) 
i 

To see the effect this transformation has on likelihood, we rewrite Eq. Q as, 

L{9,Tr) = (log7r(M|i?)) 

data ’ (27) 


wliere 

sequences S„ 


indicates an average taken over the measurements Mn and predictions Rn for all of the 
in the data set. The change in likelihood resulting from Eq. (261 is therefore given by 


L{9', tt) = L{9, tt) + e 


/ d\ogTT{M\R) 
\ dR 



(28) 


Now suppose that there is a noise function tt' that has an equivalent effect on likelihood, i.e., 


L(0',7r) = L(0,7r') + O(e2), 


(29) 


for all possible data sets We say that this transformation of the noise function tt —)■ tt' 

is “dual” to the transformation 9 ^ 9' ol model parameters. The transformed noise function will 
necessarily have the form 


log7r'(M|i?) = log7r(M|i?) + ew(M, i?) (30) 

for some function v{M,R). To determine v we consider the transformation of likelihood induced by 
TT ^ tt': 


L{9, tt') = L{9, tt) + e {i{M, i?))data ■ 


(31) 


^ For the sake of clarity we suppress the ^-dependence of and h{R) in what follows. 
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Comparing Eq. (28) and Eq. (31), we see that tt —)■ tt' will be dual to 9 ^ 6' for all possible data sets 
if and only if 


aiog 7 r(M|i?) ^ dR 


dR 




(32) 


for all sequences S. 

For general choice of vector v, no function v will exist that satisfies Eq. (32). The reason is that 
dR/dOi will typically depend on the sequence S independently of the value of R. In other words, for 
a fixed value of M and i?, the left hand side of Eq. (32) will retain a dependence on S. The right 
hand side, however, cannot have such a dependence. The converse is also true: for general choice of the 
function u, no vector v will exist such that Eq. (32) is satished for all sequences. This is evident from 
the simple fact that u is a finite dimensional vector while u is a function of the continuous quantity R 
and therefore has an infinite number of degrees of freedom. 

In fact, Eq. (32) will have a solution if and only if 




dR 


,.dif 


h[R) 


(33) 


for some function h. Here we have added the superscript “dif” because this is precisely the definition 
of a diffeomorphic mode given in Eq. (16). In this case, the function dual to this diffeomorphic 
mode is seen to be 


R) 


d\ogTr{M\R) 
-TH- 


(34) 


These findings are summarized by the Venn diagram in Fig. Arbitrary transformations of the 
model parameters 9 will alter likelihood in a way that cannot be imitated by any change to the noise 
function tt. The reverse is also true: most changes to tt cannot be imitated by a corresponding change 
in 9. However, a subset of transformations of 9 are equivalent to corresponding dual transformations 
of TT. These transformations are precisely the diffeomorphic transformations of 9. This partial duality 
between 9 and tt has a simple interpretation: the choice of how we parse an experiment into an activity 
model 9 and a noise function tt is not unique. The ambiguity in this choice is parameterized by the 
diffeomorphic modes of 9 and the dual modes of tt. 


7 Error bars from likelihood, mutual information, and noise-averaged likelihood 

We now consider the consequences of performing inference using various objective functions at large 
but finite N. Specifically, we discuss the optimal parameters and corresponding error bars that are 
found by sampling 9 from posterior distributions of the form 

p( 6 *|data) ~ ( 35 ) 

for the following choices of the objective function F{9)-. 

(a) F(9) = L{9,tt*) is likelihood computed using the correct noise function tt*. 

(b) F{9) = L{9,tt') where tt' differs from tt* by a small but arbitrary error. 

(c) F{9) = L{9,tt") where tt" differs from tt* by a small amount along a dual mode. 

(d) F{9) = I{9) is the mutual information between measurements and model predictions. 

(e) F{9) = Lna{9) is the noise-averaged likelihood. 

To streamline notation, we will use (•) to denote averages computed in multiple different contexts. In 
each case, the appropriate context will be specified by a subscript. As above (-(data denote averaging 
over a specific data set {>S'„, will indicate averaging over an infinite number of data set 

realizations. {•)g, (’)s\rm respectively denote averages over the distributions 

p{S), p{S,M), p{S\R), and p{S\R,M), the empirical distributions obtained in the infinite data limit. 
{■)g will indicate an average computed over parameter values 9 sampled from the posterior distribution 
p(0|data). Subscripts on cov(-) or var(-) should be interpreted analogously. 
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(a) 

likelihood (b) 

likelihood (c) 

likelihood 


with the correct 

with an incorrect 

with an incorrect 


noise function 

noise function 

noise function 



(arbitrary error) 

(dual mode error) 


. i 




^(eidata) ~ 

p{e\data) ~ 

p(6l|data) - 

yUOU 

ynon 

yuon 



• 

• 

- ^ - 


0 

e 

6 



ydlf 



mutual information 

p(6>|data) ~ 


(e) 


noise-averaged likelihood 


p(6l|data) ~ 


,,dif 


.,dif 


Fig. 7: Posterior distributions on model parameters resulting from various objective functions. Each 
panel schematically illustrates the posterior distribution p(6i|data) (gray shaded area) as it relates to the 
correct model 9* (dot) along both diffeomorphic (abscissa) and nondiffeomorphic (ordinate) directions 
in parameter space, (a) Likelihood with the correct noise function tt* leads to a posterior distribution 
consistent with 9* in all parameters, (b) Likelihood with a noise function tt' that differs arbitrarily 
from TT* will, in general, lead to a posterior distribution that is inconsistent with 9* along both diffeo¬ 
morphic and nondiffeomorphic modes, (c) Likelihood with a noise function tt" that differs from tt* only 
along a dual mode leads to a posterior that is inconsistent with 9* along the diffeomorphic mode 
(parallel to dashed line), but consistent with 9* in all other directions (perpendicular to dashed 
line), (d) Using mutual information give a posterior that is consistent with 9*; this posterior places 
constraints similar to likelihood along non-diffeomorphic modes but places no constraints whatsoever 
along diffeomorphic modes, (e) Using noise-averaged likelihood results in a posterior distribution sim¬ 
ilar to mutual information but with weak constraints on diffeomorphic modes resulting from the noise 
function prior pin ). 


7.1 Likelihood 


Consider Eq. (35) with F{9) = L{9,tt*) at large but finite N. The posterior distribution p(0|data) will, 
in general, be maximized at some choice of parameters 9° that deviates randomly from the correct 
parameters 9*. At large N, p(0|data) will become sharply peaked about 9° with a peak width governed 
by the Hessian of likelihood; specihcally 



cove{9,-9°,9^-9°) 


(36) 
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where 

, (37) 

e* 

is the Hessian of the likelihood. It is also readily shown (see Appendix A) that this peak width is 
consistent with the correct parameters 8*, in the sense that 


H^J = 


d^L{0,TT*) 


88 , 88 , 


cov,eai(0r - 8°, 8* - 8°) = coYeid, - 8°, 8, - 0°). 


In Appendix A we show that the Hessian of likelihood, Eq. (125), is given by 




(38) 


(39) 


where 


J{R) = '£7r*{M\R) 

M 


8\ogTT*{M\R) 

M 


J2^*{m\r) 

M 


8^logTr*{M\R) 

9^2 


(40) 


is the Fisher information of the noise function tt*. This Fisher information is a nonnegative measure 
of how sensitive our experiment is in the vicinity of We thus see that, as long as the set of 
vectors 8R/88i spans all directions in parameter space, the Hessian matrix Hij will be nonsingular. 
Using F{8) = L{8,'k*) will therefore put constraints on all directions in parameters space, and these 
constraints will shrink with increasing data as This situation is illustrated in Fig. 7a. 

Now consider what happens if instead we use a noise function tt' that deviates from tt* in a small 
but arbitrary way. Specifically, let 

log7r'(M|i?) = log7r*(M|i?) + e/(M, R) (41) 


for some function /(M, R) and small parameter e. It is readily shown (see Appendix A) that the 
maximum likelihood parameters 8' will deviate from 8* by an amount 




Uj, 


where 


8f 8R\ 

mWJs 


(42) 


This expected deviation does not depend on N and will therefore not shrink to zero in the large N 
limit. Indeed, for any choice of e > 0, there will always be an N large enough such that this bias in 8' 
dominates over the uncertainty due to finite sampling. 

Is there any restriction on the types of biases in 8' that can be produced by the choice of incorrect 
noise function -k'I In general, no. Because the Hessian matrix H is nonsingular, one can always find a 
vector w such that the deviation of 8' from 8* 

As long as the functions 

9ti.R) = 

are linearly independent for different indices i, 
vector w. 

We therefore see that arbitrary errors in the noise function will bias the inference of model parame¬ 
ters in arbitrary directions. This fact presents a major concern for standard likelihood-based inference: 
if you assume an incorrect noise function tt, the parameters 8 that you then infer will, in general, be 
biased in an unpredictable way. Moreover, the magnitude of this bias will be directly proportional to 
the magnitude of the error in the log of your assumed noise function. This problem is illustrated in 
Fig. 7b. 

® In what follows we assume that J(R) > 0 almost everywhere. This just reflects the assumption that our 
experiment actually does convey information about R through the measurements M it provides. 


in Eq. (42) points in any chosen direction of d-space. 


8R\ 

S\R 


(43) 


a function / can always be found that generates the 
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There is a case that deserves some additional consideration. Suppose we use a noise function tt" 
that differs from tt* only along a dual mode i.e., 

log tt" (MI i?) = log tt*{M\R) + R). (44) 

The maximum likelihood parameters 9” of L{9, tt") will still deviate from 9* by an amount that does 
not shrink to zero in the TV —>■ oo limit. However, this bias in parameter values will be restricted to 
the diffeomorphic mode to which is dual, i.e., 


{ 0 ” - ^n^eal = 


dif 


(45) 


This state of affairs ain’t so bad since the incorrect noise function will lead to model parameters that 
are inaccurate only along modes that we already know we cannot learn from the data. This situation 
is illustrated in Fig. 7c; see Appendix A for the derivation of Eq. (45). 


7.2 Mutual information 


The constraints on parameters imposed by using mutual information 1(9) as the objective function 
F{9) in Eq. (35) are determined by the Hessian 


_ d^ije) 

89,89, 


(46) 


Appendix B provides a detailed derivation of this Hessian, which after some computation is found to 
be given by 


K,j = - dRp{R)J{R) 


8R8R\ _/9R\ 

89i 89j / \ 89i / \ 89j / 


e- 


Comparing Eq. (47) and Eq. 


we see that for any vector v in parameter space, 

- ^ HijViVj > - ^ KijViVj > 0 . 


(47) 


(48) 






Likelihood is thus seen to constrain parameters in all directions at least as much as mutual information 
does. As expected, mutual information provides no constraint whatsoever in the direction of any 
diffeomorphic mode of the model, since 


= / dRp{R)J{R) [(/i"(A))^|^ - {h{R))%g 


= 0 . 


* J 


9- 


(49) 


The converse is also true: if there is no constraint on parameters along v, then v must be a diffeomorphic 
mode. This is because 


- X] = f dRp{R) J{R) var f ^ j 


S\R 


(50) 


e* 


Because J(R) is positive almost everywhere, the right hand side of Eq. (50) can vanish only if 
does not differ between any two sequences that have the same R value. 1‘here must therefore exist a 
function h(R) such that h{R) = sequences S. This is precisely the requirement in Eq. 

(16) that u be a diffeomorphic mode. 

However, except along diffeomorphic modes, we can generally expect that the constraints provided 
by likelihood and by mutual information will be of the same magnitude. This situation is illustrated 
in Fig. 7d. Indeed, in the next section we will see an explicit example where all nondiffeomorphic 
constraints imposed by mutual information are the same as those imposed by likelihood. 

Before proceeding, we note that the relationship between the Hessians of likelihood and mutual 
information suggests an analogy to fluid mechanics. Consider a trajectory in parameter space given by 
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No library 
sequences 

M = 0 



Gaussian sequence 
distribution 



enrichment experiment 

~ exp(ti*_R*) 


Ni selected 
sequences 



M = 1 


Fig. 8: Illustration of the Gaussian selection model of a massively parallel experiment. Each assayed 
sequence in this model is a U-dimensional vector. The library (corresponding to bin M = 0) consists 
of Nq sequences S drawn from a Gaussian distribution piib(>S') that is centered on a specific sequence 
fj,. Bin M = 1 consists of Ni sequences drawn from the distribution piib(>S') then enriched by a factor 
of exp{b*R*) where R* = S^9*. This enrichment procedure is analogous to selecting protein-bound 
DNA sequences where b*R* is negative the binding energy. Galculations in the text are performed in 
the Nq ^ Ni limit. 


= tvi, where t is time and u is a velocity vector pointing in the direction of motion. This motion in 
parameter space will induce a motion in the prediction R(t) that the model provides for every sequence 
S. The set of sequence {£'„} thus presents us with a dynamic cloud of “particles” moving about in 
i?-space. At t = 0, the quantity will be proportional to the average kinetic energy of particles 

at location R. The quantity will be proportional to the (per particle) kinetic energy of the bulk 

fluid element at i?, a quantity that does not count energy due to thermal motion. In this way we see 
that — j HijViVj is a weighted tally of total kinetic energy, whereas — ^ KijViVj corresponds to 

a tally of internal thermal energy only, the kinetic energy of bulk motion having been subtracted out. 


7.3 Noise-averaged likelihood 


Noise-averaged likelihood provides constraints in between those of likelihood, computed using the cor¬ 
rect noise function, and those of mutual information. This is illustrated in Fig. 7e. Whereas mutual 
information provides no constraints whatsoever on the diffeomorphic modes of 0, noise-averaged like¬ 
lihood provides weak constraints in these directions. These soft constraints reflect the Hessian of A(9) 
in Eq. (12). The constraints along diffeomorphic modes, however, have an upper bound on how tight 
they can become in the iV —>■ oo limit. This is because such constraints only reflect our prior p{tt) on 
the noise function, not the information we glean from data. 


8 Worked example: Gaussian selection 

The above principles can be illustrated in the following analytically tractable model of a massively 
parallel experiment, which we call the “Gaussian selection model.” In this model, our experiment 
starts with a large library of “DNA” sequences S, each of which is actually a D-dimensional vector 
drawn from a Gaussian probability distributiorj^ 

Piib(5') = (27r)"'°/2exp ^ . (51) 

® For the sake of simplicity we set the covariance matrix of this distribution equal to the identity matrix. 
The more general case of a non-identity covariance matrix yields the same basic results. Also, we note that, 
while approximating discrete DNA sequences by continuous vectors might seem crude, it is only the marginal 
distributions p{R\M) that matter for the inference problem. Most of the quantities R that one encounters in 
practice are computed by summing up contributions from a large number of different nucleotide positions. In 
such cases, the marginal distributions p{R\M) will often be nearly continuous and virtually indistinguishable 
from the marginal distributions one might obtain from a Gaussian sequence library. 
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Here, /r is a Z?-dimensional vector defining the average sequence in the library. From this library we 
extract sequences into two bins, labeled M = 0 and M = 1. We fill the M = 0 bin with sequences 
sampled indiscriminately from the library. The M = 1 bin is filled with sequences sampled from this 
library with relative probability 


P{M = 1\S) 
p{M = els') 


= exp(a* + b*R*) 


(52) 


where the activity R* is defined as the dot product of S with a H-dimensional vector 0*, i.e., 


R* = s’^e*. 


(53) 


We use Nm to denote the number of sequences in each bin M, along with N = Nq -\- Ni. 

All of our calculations are performed in the limit where Ni is large but for which Nq is far larger. 
More specifically, we assume that exp(a* + 6*i?*) << 1 everywhere that bothp(5'|M = 0) andp(S'|M = 
1) are significant. We use e to denote the ratio 


p{M = 1) _ iVi 
p{M = 0) “ Ao’ 


(54) 


and all of our calculations are carried out only to first order in e. This model experiment is illustrated 
in Fig. 

Our goal is this: given the sampled sequences in the two bins, recover the parameters 6* defining 
the sequence-function relationship for R*. To do this, we adopt the following model for the sequence- 
dependent activity R: 


R = (55) 

where 9 is the Z?-dimensional vector we wish to infer. From the arguments above and in [28], it is 
readily seen that the magnitude of d, i.e. |0|, is the only diffeomorphic mode of the model: changing 
this parameter rescales R, which preserves rank order. 


8.1 Bin-specific distributions 

We can readily calculate the conditional sequence distribution p{S\M) for each bin M, as well as the 
conditional distribution p{R\M) of model predictions. Because the sequences sampled for bin 0 are 
indiscriminately drawn from pub, we have 


p(S'|M = 0) = piib(5') = (27r) ^/^exp ^ (56) 

The selected distribution of sequences is found to be 

p(S\M = 1) = (2^)-^/2exp . (57) 

The value of e is found to be related to a*, b*, and 9* via 

e = exp -h b*f/'9* + ^ ^ \ . (58) 


Appendix C provides an explicit derivation of Eq. (571 and Eq. 

We compute the distribution of model predictions for each bin as follows. Eor each bin M, this 
distribution is defined as 


p{R\M)= dSSiR-9'^S)p{S\M). 


(59) 
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This can be analytically calculated for both of the 
distribution. We find that 


p{R\M = 0) 
p{R\M = 1) 


V^\0\ 


bins owing to the Gaussian form of each sequence 

(60) 

2|6i|2 J ’ ^ ^ 

(R-[p + b*e*fer \ 

2|d|2 )■ ^ ’ 


See Appendix C for details. 


8.2 Noise function 


To compute likelihood, we must posit a noise function tt{M\R). Based on our prior knowledge of the 
selection procedure, we choose Tr{M\R) so that 


7r(M = l|i?) 

tt{M = 0\R) 


exp(a + bR), 


(62) 


where a and b are scalar parameters that we might or might not know a priori. This, combined with 
the normalization requirement, gives 

a+bR 1 

This noise function tt is correct when a = a* and b — b*. The parameter b is dual to the diffeomorphic 
mode |d|, whereas the parameter a is not dual to any diffeomorphic mode. 

In the experimental setup used to motivate the Gaussian selection model, the parameter a is affected 
by many aspects of the experiment, including the concentration of the protein used in the binding assay, 
the efficiency of DNA extraction from the gel, and the relative amount of PGR amplification used for 
the bin 0 and bin 1 sequences. In practice, these aspects of the experiment are very hard to control, 
much less predict. From the results in the previous section, we can expect that if we assume a specific 
value for a and perform likelihood-based inference, inaccuracies in this value for a will distort our 
inferred model 9 in an unpredictable (i.e., nondiffeomorphic) way. We will, in fact, see that this is the 
case. The solution to this problem, of course, is to infer 9 alone by maximizing the mutual information 
7(0); in this case the values for a and b become irrelevant. Alternatively, one can place a prior on a 
and b, then maximize noise-averaged likelihood Lna,{9). We now analytically explore the consequences 
of these three approaches. 


8.3 Likelihood 


Using the noise function in Eq. (631, the likelihood L becomes a function of 9, a, and b. Computing L 

62 | 6 »| 2 ' 


in the iV —^ oo and e —>■ 0 limits, we find that 


L{9, a, b) = e[a -I- b9^ p + bb*9^ 9*] — exp I a + b9^ p + 


(64) 


We now consider the consequences of various approaches for using L{9, a, b) to estimate 9*. In each 
case, the inferred optimum will be denoted by a superscript ‘o.’ Standard likelihood-based inference 
requires that we assume a specific value for a and for b, then optimize L(9, a, b) over 9 alone by setting 


0 = 




(65) 


6° ^a,b 


for each component i. By this criteria we find that the optimal model 9° is given by a linear combination 
of 9* and p: 


cb* 

= — 9 
b 


c — I 


-M, 


( 66 ) 
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where c is a scalar that solves the transcendental equation 

c = exp ^[a* — a] H-^— \b*0* + . (67) 

See Appendix C for the derivation of this result. Note that c is determined only by the value of a and 
not by the value of b. Moreover, c = 1 if and only if a = a*. 

If our assumed noise function is correct, i.e., a = a* and b = 6*, then 


0 ° = 9 *. ( 68 ) 

Thus, maximizing likelihood will identify the correct model parameters. This exemplifies the general 
behavior illustrated in Fig. 

If a = a* but b ^ b*, our assumed noise function will differ from the correct noise function only in 
a manner dual to the diffeomorphic mode |0|. In this case we find that c = I and 



9° is thus proportional but not equal to 9*. This comports with our claim above that the diffeomorphic 
mode of the inferred model, i.e. |0°|, will be biased so as to compensate for the error in the dual 
parameter b. This finding follows the behavior described in Fig. [^. 

If a 7 ^ a*, however, c ^ 1. As a result, 9° is a nontrivial linear combination of 9* and /r, and will 
thus point in a different direction than 9*. This is true regardless of the value of b. This behavior is 
illustrated in Fig. [^: errors in non-dual parameters of the noise function will typically lead to errors 
in nondiffeomorphic parameters of the activity model. 

We now consider the error bars that likelihood places on model parameters. Setting 9 = 9° + 59 
and expanding L{9,a,b) about 0°, we find that 

NL{9,a*,b*) « NL{9°,a*,b*) -^ ^ (70) 

i,3 


where Aij = 5ij + + b*9*){fij -|- b*9*). Note that all eigenvalues of A are greater or equal to 1. 

Adopting the posterior distribution 

p(0|data) - (71) 


therefore gives a covariance matrix on 9 of 


{ 69 , 59 ^} 



Nib*"^' 


(72) 


Thus, 59 ^ ' in all directions of 0-space. Although Aij will change somewhat if a and b deviate 

from a* and b*, this same scaling behavior will hold. Therefore, when the noise function is incorrect 
and N is sufficiently large, the finite bias introduced into 0° will cause 9* to fall outside the inferred 
error bars. 


8.4 Mutual information 

In the e —>■ 0 limit, Eq. Q simplifies to 




(73) 


The lowest order term on the right hand side can be evaluated exactly using Eq. (60) and Eq. (611: 

e 6*2 ( 0 T 0*)2 


7(0) = 




(74) 
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See Appendix C for details. Note that the expression on the right is invariant under a rescaling of 0. 
This reflects the fact that \9\ is a diffeomorphic mode of the model dehned in Eq. (55). 

To find the model 0° that maximizes mutual information, we set 


0 = 




^b^20oT0* 


9 ° 


\0 


o|2 


0 * - 01 


noT0* 


\0^ 


o\2 


The optimal model 9^ must therefore be parallel to i.e. 

9° oc 0*. 

Expanding about 9 = 9° + 60 as above, we find that 

Af 

NI{9) = NI{9°) -^— {50A_f 


(75) 


(76) 


(77) 


where 50 is the component of 50 perpendicular to 0*\ see Appendix C. Therefore, if we use the 
posterior distribution p(d|data) ~ to infer d, we find uncertainties in directions perpendicular to 

0* of magnitude N-^ ' . These error bars are only slightly larger than those obtained using likelihood, 
and have the same dependence on N. However, we find no constraint whatsoever on the component of 
50 parallel to 9*. These results are illustrated by Fig. [7}i. 


8.5 Noise-averaged likelihood 


We can also compute the noise-averaged likelihood, Lna{0), in the case of a uniform prior on a and b, 
i.e. pItt) = p{a, b) = C where C is an infinitesimal constant. We find that 


exp[iVLna(6')] = J dTTp{TT)exp[NL{0,TT)] 



See the Appendix C for details. Thus, 


(78) 


N exp 


a + b9^p + 



(80) 


Ln.{0) = m 


1 

N 


log |0| -I- const, 


(81) 


where the constant (which absorbs C entirely) does not depend on 0. If we perform Bayesian inference 
using noise-averaged likelihood, i.e., using p(0|data) ^ therefore find in the large N 

limit that 69± is constrained in the same way as if we had used mutual information. The noise function 
prior we have assumed further results in weak constraints on |0| that do not tighten as N increases!^ 
This is represented in Fig.[^. 


9 Discussion 

The systematic study of quantitative sequence-function relationships in biology is just now becoming 
possible, thanks to the development of a variety of massively parallel experiments. Concepts and 
methods from statistical physics are likely to prove valuable for understanding this basic class of 
biological phenomena as well as for learning sequence-function relationships from data. 

In this paper we have discussed the problem of learning parametric models of sequence-function 
relationships from experiments having poorly characterized experimental noise. We have seen that 

In the case at hand, \6°\ is pushed all the way to zero. This is an artifact of the simple flat prior p{a, b). If 
we instead adopt a weak Gaussian prior on b, we can still carry out the computation of Lna analytically, and 
in this case we find that \9°\ is finite. 
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standard likelihood-based inference, which requires an explicit model of experimental noise, will gen¬ 
erally lead to incorrect model parameters due to errors in the assumed noise function. By contrast, 
mutual-information-based inference allows one to learn parametric models without having to assume 
any noise function at all. Mutual-information-based inference is unable to pin down the values of model 
parameters along diffeomorphic modes. This behavior reflects a fundamental difference between how 
diffeomorphic and nondiffeomorphic modes are constrained by data. Diffeomorphic modes arise from 
arbitrariness in the distinction between the activity model and the noise function. These findings were 
illustrated using an analytically tractable model for a massively parallel experiment. 

The study of quantitative sequence-function relationships still presents many challenges, both theo¬ 
retical and computational. One major practical difficulty with the mutual-information-based approach 
described here is the difficulty of accurately estimating mutual information from data. Although meth¬ 
ods are available for doing this |44j . it remains unclear whether any are accurate enough to enable 
computational sampling of the posterior distribution p(0|data) ~ as suggested here. Moreover, 

none of these estimation methods is regarded as definitive. We believe this lack of clarity regarding 
how to estimate mutual information reflects the fact that the density estimation problem itself has 
never been fully solved, even in one or two dimensions. We are hopeful, however, that field-theoretic 
methods for estimating probability densities might help resolve the problem of estimating 

low-dimensional probability densities as well as estimating mutual information. 

The problem of model selection poses a major theoretical challenge. Ideally, one would like to 
explore a hierarchy of possible model classes when fitting parametric models to data. However, when 
considering effective models it is unclear how to move far beyond independent site models (e.g., energy 
matrices) due to the number of parameters growing exponentially with the length of the sequence. 
Alternatively, when learning mechanistic models such as the model of the lac promoter featured in 
Fig. 3, it is unclear how to go about systematically testing different arrangements of binding sites, 
different protein-protein interactions, and so on. We emphasize that this model prioritization problem 
is fundamentally theoretical, not computational, and as of now there is little clarity on how to address 
this matter. 

Finally, the geometric structure of sequence-function relationships presents an array of intriguing 
questions. For instance, very little is known (in any system) about how convex or glassy such land¬ 
scapes in sequence space are, what their density of states looks like, etc.. Most of the biological and 
evolutionary implications of these aspects of sequence-function relationships also have yet to be worked 
out. We believe that the methods and ideas of statistical physics may lead to important insights into 
these questions in the near future. 


10 Appendix A: Maximum likelihood under various noise functions 

At the correct noise function tt*, likelihood is given by 


L{e,n*) = 


(82) 


Taylor expanding this quantity about 0* gives 


or 

L{9,nn=L{9*,n*) + J2 ^ 




d^L 




d9id9j 


{9. - 9*){9, - 9* 


(83) 


We define the random vector u in terms of the coefficient of the linear term of this expansion: 


Ui _ dL 
vV “ 89 ^ 


/ d\ogTT*{M\R) dR 


^data 6* 


e 


dR 


(84) 
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Because Ui/\/~N is defined as a sum of N random terms, and because the mean of these terms vanishes, 
the covariance will, by the central limit theorem, be given by 




51og7r*(M|i?) 


dR 

= J2f dRp{M,R) 


M 


dR dR\ 

/ 

/ S,M 

d\ogTT*{M\R) 


OR 


dR dR\ 

dOi dOj / 


(85) 


( 86 ) 


At 0 = 0*, Each measurement M will provide no additional information about S beyond that 
provided by the model prediction R = 0{S). Mathematically this means that 


piS\R,M)\,, = piS\R)\,, 


(87) 


for all S', R, and M. Equivalently, the conditional expectation value of any sequence-dependent function 
/(S) will obey 


ifiS)) 


S\R,M 


— {fiR))s\R 


for all M. We use this fact to simplify Eq. ( 861 : 


= J‘‘Rp{R) 




aiog7r*(M|i?) 


dR 


( 88 ) 

(89) 

(90) 


where J{R) is the Fisher information from Eq. (40). 
We compute the Hessian of likelihood as follows: 


dOidOj 


_ / d^logTr*{M\R) dR dR\ 

e* \ ^^3 / SM 


l^logTT*{M\R) d'^R 


dR 


dOidOj / g 


e* 


The second term on the right hand side vanishes because of Eq. ( | 88 [ ) : 

aiog7r*(M|i?) / d'^R 


l^\ogT:*{M\R) d'^R 


dR 


dOidOj / g 


= J2 dRp{R,M)- 

0* M d 

= J dRp{R) ^ 

= J dRp{R) 

= J dRp{R) 

= 0 . 


dR 


dOidOj! g^g^ j^ 


dOidOj / 

d^R 


dOidOj! 


d'^R 


^7r(M|i?) 

. M 

\d 

dR 


d\ogTT{M\R) 


dR 


;^7r(M|S-) 


M 


dOidOj / 


Al¬ 

ai? 


We therefore find that 


= E / dRp{R, M) 

M d 


d^\og-K*{M\R) 


i?2 


= -/"”(«)■'(«) (if),,. 


(91) 

(92) 

(93) 

(94) 

(95) 

(96) 

(97) 

(98) 
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which is Eq. (39). Note that, from Eq. (90), = —Hij. 

The optimum 9° of L{6,tt*) will occur when 


0 = 


dL{9,Tr*) 


do 




We therefore find that, to lowest order in N 

The covariance of 9° is thus given by 




H- 


k,l 


real ZJ~^ _ _ 

N ~ TV ’ 


which is Eq. (38). 


Under the incorrect noise function tt' defined in Eq. (41), 
L{9y)=L{9,^*) + e{f{M,R))^^,^ 
«L(r,^*) + e(/(M,i?))5|,. 


+ E 


.y/N 


{0^ - 0*) + -Y, U,,(9^ - 9l)(9, - 9*) + 


where 


Wi = 




^ OR d9i / g 

Let 9' denote the maximum of L{9,tt'). Setting ^ = 0 


, we find 


(99) 

( 100 ) 

( 101 ) 

( 102 ) 

(103) 

(104) 




.Vn 


(105) 


from which we get Eq. (421. 

In the case of a noise function tt" that differs from tt* only along a dual mode, as in Eq. (44), the 
vector Wi is given by 


/dv^'^^ dR\ 

“ \ ^Wj, 


The maximum likelihood parameters 9" will therefore satisfy 


(106) 


(107) 


_^^d‘^\ogTTdRdR ^dlog-K d^R 


9i?2 d9, d9j 


dR d9id9j / gj^ 


(«;' - »;> 


3 / real 


+e 


dR d 

WidR 

d dlogn 
Wi dR 


dlogTT 

dR 


h{R) 


S,M 


e=e* 


dR 




S,M 


d 5 log TT dR 
Wi dR 


Eg) («'-";>„.,+<)) 


)=e* 


S,M 


9 = 9 - 


(108) 

(109) 

( 110 ) 
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which is solved by Eq. (45). The fact that this uniquely specifies {O'l 
H being nonsingular. 


^Dreai follows from the Hessian 


11 Appendix B: Gradient and Hessian of mutual information 

Here we calculate the gradient and Hessian of mutual information evaluated at 6 = 6*. We do this by 
first computing derivatives of the empirical probability distributions p{R) and p{R, M) with respect to 
model parameters. The mathematical trick used to do this is adapted from [3T]. These results are first 
applied to likelihood in order to demonstrate their use and correctness. We then use this approach to 
compute the gradient and Hessian of mutual information. To clarify these derivations, we use r(0, S), 
instead of 9{S), to explicitly denote the model prediction R as a function of sequence S and model 
parameters 9. We also define di= and use / dS to represent sums over sequences. 


11.1 How the distribution of model predictions changes with model parameters 
The empirical probability distribution of model predictions R is given by 

p{R) = J dSp{S)S{R-r{9,S)). (Ill) 

The gradient of this probability distribution with respect to model parameters is computed as follows: 


d,piR)= dSp{S)d,diR-r{9,S)) 


= - j dSp{S) 


-^S{R-r{9,S)) 


A 

d r 


p{R) dSp{S\R)6{R-r{9,S))dir 


OR E 

Similarly, the Hessian of p{R) is given by 


d,d,p{R) = / dSp{S) 


J dSpiS) I 


ai?2 


d{R-r{9,S)) 


dirdjT — 


—SiR-r{9,S)) 


d^djT 


92 


p{R) 


A 

m I 


p{R) {didjr)g^j^ 


Analogous results follow for the gradient and Hessian of the joint distribution p{R, M): 


d r 


d,p{R,M) = - ^ \p{R,M) 
didjpiR,M) = p{R,M) {d^rdjT) 


dR 


p{R,M) 


( 112 ) 

(113) 

(114) 

(115) 

(116) 

(117) 

(118) 
(119) 


11.2 Gradient and Hessian of likelihood 

Likelihood can be expressed in terms of the empirical distribution p{R,M) as 

L(0,7r) = ^^ j dRp{R,M)\ogTr{M\R). 

M d 


(120) 
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Keep in mind that R is just a dummy variable in this integral; the empirical distribution p is the only 
quantity that depends on 6. The gradient of likelihood is therefore computed as 


diL = / dR [dip{R, M)] log n{M\R) 


M 


J M) {dir) |log7r(M|i?) 


M 

= J2 [dRpiR,M) 

/aiog7r(M|i?)^ 

“ \ dR 


d\ogTT{M\R) 

m 


{dir) 


S\R,M 


( 121 ) 

( 122 ) 

(123) 

(124) 


S,M 


Note that in going from Eq. (122) to Eq. (1231 we used integration by parts. The Hessian of likelihood 
is computed similarly: 


didjL = ^2 / dR[didjp{R, M)]\ogTT{M\R) 


(125) 


M 


log7r(M|i?) p{R, M) {d^rdjr)p{R, M) {didjr )| (126) 


= ^ / inm, M) I g „ 

Af 1 


31og7r(M|ii) 

M 


{d^djr) 


j’/S\R. 


,m| • 


(127) 


This expression is valid for all choices 0 and tt. 

Restricting our attention now to 6 = 6* and tt = tt*, we see that the second term in Eq. (1271 
vanishes as it did in Eq. (92) through Eq. (96). Moreover, the first term gives 


didjL = - j dRp{R)J{R) {dirdjr ), 


(128) 


which is the formula obtained for Hij in Eq. (39). 


11.3 Gradient and Hessian of mutual information 

The gradient and Hessian computations for mutual information are simplified by expressing mutual 
information in terms of its component entropies. We write 

i{e) = HR{e) + HM-HRM{e) ( 129 ) 


where 


Hrm{ 0) = -J2 dRp{R,M)logp{M,R), 

M 

Hr{0) = - J dRp{R) logp{R), 

Hm = - p{M) logp{M). 

M 


(130) 

(131) 

(132) 
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The gradient of Hu is given by 

diHn = - J dR [dip{R)] logp{R) - J dRp{R)di \ogp{R) 
= - j dR [dip{R)] logp{R) - J dRp{R)^^^dip{R) 
= - J dR [dip{R)] log p{R) -Oil 
= - J dR[dip{R)] logp{R). 


Similarly, 


diHRM = - [ dR[dip{R,M)]\ogp{R,M). 

M 


Hm doesn’t depend on 9, so diHM = 0. The resulting gradient of mutual information is 

dj = '^ dR [d^p{R, M)] logp(i?, ^) - J logp(i?) 

p{R,M) 


M 


= ^JdR [d,p{R, M)] log 
= ^ /" dR[d^p{R,M)]logp{M\R). 


M 


Note from Eq. (1211 that did = diL whenever tt(M\R) = p{M\R). 

Now let’s compute the Hessian of Hu: 

d^djH[i = - [ dR [didjp{R)] logp{R) - [ dR[d^p{R)]di logp{R) 


= - dR[didjp{R)]logp{R) - / dRp{R)[d^logp{R)][djlogp{R)]. 


Similarly, 


didjHRM = - ^ dR ldidjp(R, M)] logp{R, M) 

M d 

“X] / dRp{R,M)[dilogp{R,M)\[djlogp{R,M)\. 


M 


The Hessian of mutual information is therefore given by, 

didji = didjHR - d^djHfiM- 


Using the form of didjL in Eq. (1251, we see that this reduces to 


where 


and 


d,d,i = d,d,L + 


Ay = / dRp{R)[dilogp{R)][djlogp{R)] 


Ay" = E / dRp{R,M)[d,logp{R,M)][d,logp{R,M)]. 

M 


(133) 

(134) 

(135) 

(136) 

(137) 

(138) 

(139) 

(140) 


(141) 

(142) 


(143) 

(144) 

(145) 

(146) 

(147) 
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We now split Afj and into four terms each. For Afj we get 


A^ = 

V 


d 


p{R) OR 

^ ^ ia,r) 


where 


_ aR I tdR I tdR I /^R 

“T ^ij “T ^ji “r ^ij 5 


A^j= J dRp{R) 
J dRp{R) 
= J dRp{R) 


dlogpiRy"^ 

M 

91ogp(i?) 

M 

dR 




{9^r)s\R ^ (djr) 


dR ’ 




Similarly, 

where 


= A^f + + Bf^ + 


AE = E / dRp{R,M) 

^*E = E / dRp{R,M) 

aiogp(i?, M) 

ai? 

aiogp(i?, M) 

dR 

= [ dRp{R,M) 

M 





S|i^,M ’ 


d 


d 


dR 


j(149) 

(150) 

(151) 

(152) 

(153) 

(154) 

(155) 

(156) 

(157) 


It is unclear how to simplify the 


^5"" = E / dRpiR.M) [A 

M L 

= J dRp{R) ^ (a*r)s| 


1 cnoices or w. At w w , nowever, tne 
1 causes a lot of cancellations to occur: 


d 


dR 


d , 


_ /oi? 

— ^ij 


and 


Bf^^ = j dRp{R) 

= J dRp{R) 

= J dRp{R) 

= J dRp{R) 
= J dRp{R) 

_ tdR 

— 


_ M 

^ p{M\R) dp{R,M) 
^^piR,M) dR 

1 dp{R) 
p{R) dR 
d\ogp{R) 




dR 
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s\R ^ (^i^)s|i? 


M 




i9^r)s\R ^ (djr) 


dR 
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U 


(158) 

(159) 

(160) 

(161) 

(162) 

(163) 

(164) 

(165) 

(166) 
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We therefore find that, 

Af^ — Af-j = Arm — Ar 


= / dRp{R) (dir)g^j^{djr) 


M 


d log p{R,M) 


dR 


d log p{R) 

M 


(167) 

(168) 


The expression in braces can be simplified as follows: 


Y,p(.M\R) 
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i91ogp(i?, M) 

2 

dlogp(i?) 

dR 


dR 


= ^p(M|i?) 

M 

= J2p{m\r) 


dlogp{M\R) ^ dlogp{R) 


dR 


OR 


d log p{R) 


M 


i91ogp(M|i?) 


1 2 


dR 
1 dp{R) d 


+ 


d logp{R) d logp{M\R) I 


dR 


dR 


= J{R). 

The Hessian of mutual information at 6* = 6** therefore has a rather simple form: 

arm aR - l'dRp{R)J{R) {dirdjr)g^j^-{dir)s^j^{djr)g^j^ 


R-ij — R-ij 


A^^-Ai^ = - 


K169) 

(170) 

(171) 

(172) 

(173) 


which is Eq. (47). 


12 Appendix C: Gaussian selection model 

12.1 Derivation of Eq. 

Applying Bayes’s theorem twice, 

p{s\M =1) = r ^\^? p{s) = ^^L pis\M= o). 


p{M = 1) 


Using Eqs. 56 52 and 54 then gives 


p{M=l) p{M = 0\Sy 

\S-pr 


piS\M = 1) = e-ie“*+*'*^"^*(2^)-^/2exp 


Next we complete the square in the exponent: 


\s-py 

2 




* oTn* 


|S'p + \py + \b*9*y - 2p^S - 2b*S'^9* + 2b*p^e* 




(174) 

(175) 

(176) 

(177) 

(178) 


Erom the first term in Eq. (178) we recover Eq. (57). To get e, we substitute Eq. (178) into Eq. dlT^ . 
Comparing this to Eq. (57) then gives 


1 = e ^e“ exp 


\b*6* 


+ b*p^e* 


(179) 


Solving for e recovers Eq. (58). 
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12.2 Derivation of Eqs. [60|and[6T] 

Here we describe how to compute p{R\M) where R = S. We first consider the case of M = 0. 

p{R\M = 0) = y dSp{S\M = Q)5{R - S'^O) (180) 

= J dSp{S\M = 0)S{[R - p^e] -[S- pfd) (181) 

= j dSp{S\M = 0)S{R' - S'^e) (182) 

where R' = R—p^9 and S' = S — p. We have chosen to work with R' and S' instead of R and S because 

p{S'\M = 0) is centered about 0. Now, split S' up into the components parallel and perpendicular to 

0: 


S' = S'^ + S'J, 


(183) 


where S'j_ is a vector of dimension D — 1 orthogonal to 6, S'y is a scalar, and 9 = 9/\9\. This definition 
gives S'"^9 = S'm| 0|. Continuing with the integration. 


p{R\M = 0) = f dS'j^ f dS'fi S{R' - ^i', \9\){2tt)-^/^ exp 

J J — OO 

nOO / ^^2 \ 

= d5'<5(i?'-,5[||0|)(27r)-i/2expf-^j 




/•OO / R' \ I S', 

I '^'^11 VM ~ I “ 


72 ' 
2~ 


= |0| ^(27r) exp ( — 


R 


n 


m 


Finally, substituting R back for R' gives 

p{R\M = 0) = ^^exp 

^ ^ V^\0\ V 2|0|2 

To compute p{R\M = 1), we just replace p ^ p + b*9*, giving 

1 f iR-[b^ + b*9*f9)^ 
p{R\M = 1) = ^=77 exp ' ^ ‘ ' 


^/^\9\ 


2|0|2 


(184) 

(185) 

(186) 

(187) 

(188) 

(189) 


12.3 Derivation of Eq. (64) 


We compute likelihood in the N —>■ oo limit as follows: 

L{9,a,b) = '^p{M) J dRp{R\M)logTT{M\R) (190) 

r 1 Nt C pa+bR 

J dRp{R\M = 0) log + ^ J dRp(R\M = 1) log (191) 

J dRp{R\M = 0)6"+^^+^ J dRpiR\M =l)[a + bR] (192) 

- + e (a + &7 ?)s|m=i • (193) 


M 

No 

N 
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In deriving Eq. (1931 we assumed that <C 1 for all values of R over which both p{R\M = 0) and 

p{R\M = 1) have significant support. This assumption necessarily holds in the e —>■ 0 limit. We have 


also kept only the lowest order terms in e. Note in particular that order 


The second term in Eq. (193) can be directly read off from Eq. (611: 

(a + bR}j= a b (i?) q = o + bp^0 + bb*6^9*. 


(194) 


From Eq. (60) we see that the first term in Eq. (193) can be computed by completing the square: 
(i? - p'^ef 


2W 


bR = - 


i?2 + - 2{p^9)R - 2b\e\^R 

21012 

i?2 + (^^0)2 + &2|0|4 _ 2{p^e)R - 2b\e\^R + 2(/i'^0)6|0|2 


+6(/r^0) + 


02|0|2 


{R- p^e-b\e\^f 

W 


2 | 0|2 


b{p^0) 


02|0|2 


from which we get 




a + 6(/i^0) + 


62|0|2 


(195) 

(196) 

(197) 

(198) 


Plugging Eq. (194) and Eq. (198) into Eq. (193) gives the formula for L{9,a,b) in Eq. (64). 


12.4 Derivation of Eqs. [M|and[67| 


Here we show how to derive the optimal 0 for L{9, a, b), with a and b fixed. Setting the gradient of L 
with respect to 0 to zero, 


0 = 




= €b{pi + 6*0*) — b{pi + 60°) exp a + bp^ 0° + 




This gives 


Pi + 60° = e{pi + 6*0*) exp —a — bp^ 0° — 


62|0°|2 


= c{p, + b*9*) 


where c is a constant satisfying 


L2|^i 

c = eexp I —a — bp"’'9° -- 


o|2 


6*2|0*|2-62|0°i2 


= exp I [a* — a] + p’ [6*0* — 60°] + 


(199) 

( 200 ) 
( 201 ) 

( 202 ) 

(203) 


We thus find Eq. (66). Note that the right hand side of the above equation depends implicitly on c 
through the valu e of 0°. To eliminate 0° from the equation for c, we let A denote the 0* dependent 
part of Eq. (203), then substitute in Eq. (66): 


A = p'^[b* 9* -b0°] + 


6 * 2 | 0*|2 - 62 | 5 ( o |2 


= p"’'[b*9*{l - c) - (c - 1)^] + 

= (l-c)6*/r^0* + (l-c)|/r|2 + 
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2 2 
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Using 


(c — 1) — c(c — 1) = 1 — c^, and (1 — c)— 


(1 — c)^ 1 — (? 


2 ’ 


we get 


^ = (I - 


1-c^ 


lb*d* + 


We thus find the transcendental equation for c, 


= exp ( [a* - a] + ^ |6*6>* + /rp ) , 


which is Eq. (67). 


(207) 


(208) 

(209) 


( 210 ) 


12.5 Derivation of Eq. (70) 


Erom the expression for likelihood in Eq. (64), we find that the Hessian of likelihood is 

d^L{e,a*,b*) 

9- 


Hij — 


de.dOj 


= [-b*^5,j - {b*n, + b*^l){b*Hj + b*^e*)] exp + b*9^fi + 






where 


Aij = Sij + + b*e*){^ij + b*9*). 


( 211 ) 

( 212 ) 

(213) 

(214) 


makes use of the approximations Ni « eN, which will hold in the e —^ 0 limit, and 


Note: in deriving Eq. (213) we used the expression for e in Eq. (58). The expression in Eq. (70) further 

(215) 


d^L{9,a*,b*) 


d9,d9j 


d^L{9,a\h*) 


d9,d9j 


which will hold in the large N limit. 


12.6 Derivation of Eqs. 73 and 74 


We derive Eq. (73) as follows. To ease notation a bit, we define pm{R) = p{R\AI). 

. Pm{R) 


I[R;M]= f dRp{M,R) log 

A/f — n 1 ^ 
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Because p{M = 1) = e + O(e^), the first term in Eq. (219) is the right hand side of Eq. (73) to lowest 
order in e. We now show that the second term is of order and can therefore be ignored. Up to terms 
of order 


Rearranging this gives 


p{R) = (1 - e)po{R) + epi(R). 
p{R) - epi{R) 


PoiR) = 


1 - e 


Plugging this into the second term of Eq. (219) gives 


1 dRp{R) log 

[ r ^( 

dRp{R) log 

1 + e ^ 


piR) J\ 
pi{R) 


p{R) 


0{e^) 


Eq. (74) is derived as follows: 

I{e) = e ( log 
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p{R\M = 1) 


= e 


2|6'|2 

e 

e 


p{R\M = 0)/ 

{R-p^ef {\R- p^e]-b*e'^e*f 

2 | 0|2 2 | 0|2 

{2[R - p^e]h*e'^e* - 


M=1 


(2[(i?)M=i - p^o]b*e'^6* - {b*e'^e*y 
(2[b*9^0*]b*9^0* - {b*9^9*)‘^) 


2|6l|2 
e6*2 {e^e*f 


12.7 Derivation of Eq. ([T^ 
To derive Eq. ( [77| ), we set 


0 = 9* + S9ii + S9± 


( 220 ) 

( 221 ) 

( 222 ) 

(223) 

(224) 

(225) 

(226) 

(227) 

(228) 

(229) 

(230) 

(231) 


(232) 


where (56*|| is the deviation of 0 in the direction of 9*, and 60± is the deviation perpendicular to 0*. 
This gives 


{0'^9*y 

~W~ 


(|r|2 + 56i'fr)2 


= \0 


\0*\^ + 269^0* + \S0\\\^ + \69a_\^ 
\9*\‘^ + 269^9* + \69ii\‘^ 


\9*\^ + 260^0* + \d0\\\^ + \d0^\^ 


(233) 

(234) 


= \9 


*|2 


1 - 




(235) 

(236) 


= |r|2-|50^|2 + .... 

The result in Eq. ([T^) readily follows by substituting this into the formula for mutual information in 
Eq. (74), then approximating the Hessian of mutual information at 9° by the Hessian at 0*. 
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12.8 Derivation of Eq. [80| 


Here we show how to evaluate the equation, Eq. (79), for the noise-averaged likelihood First, 

interchange the order of integration and define a = a -I- 60^/r. This gives. 




/ OO nOO / 

db / da' exp Ni [o' -I- bb*6^6*] — N exp ( a' -I- 

-OO j —OO L V 

Next, define M = N exp j, u = Me“ , and so e“ = u/M, e“ da' = du/M. This gives 

,\ ATi-l 




= C 


' —OO 
/•OO 


dbe 




exp 


-Me^ 


/•OO 

/ duu^^-^ exp[-u] 

Jo 


M- 


nM 


= Cr{Ni) / dbe^ 

J —OO 

/ OO 

db exp 

-OO 

/ OO 

db exp 

-OO 

/ OO 
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-OO 

'Nib*^ 


Nibb*0^0* - Ni 


62|0|2 
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2bb* 


.0^0 

w 


- V 


Ni\0\‘^ fb*‘^{0^0*)‘^ 

- l,*OT0*- 

“'ll 

2 | 0|4 

[ |0P J 

)\ 


= Cr{Ni) exp 

= cr{Ni) 


/•OO 


27r 


Ni\0\- 


exp 


|0P 

Nib*"^ {0^0 


' db exp 

— OO 

*\2 ' 




b- 


b*0^0^ 


-\ 2N 


W 


which is Eq. (80). 
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